INTRODUCTION

Groups are given separate data sets as train and test with 60 features consist of 46 binary and 14 continuous variables with a binary target variable. The aim of the project is to build a classification model that predicts the unknown target values of test data close as possible to their actual values in terms of balanced error rate and area under the ROC curve. As a group we proposed an approach is given below:
• Firstly, train and test data was merged as a whole data to perform manipulation steps at once.
• All features were controlled in terms of skewness and outliers.
• Distributions and value counts of each features were checked and highly imbalanced features (x50, x52) were discarded.
• Logarithm transformation was applied for skewed and unnormal features.
• MinMaxScaler method was applied to features(x1, x8, x9, x10, x11) considered as continuous.
• Dummy encoding method was applied to features(x5, x6, x7) considered as nominal categorical values.
• Relatively important features were detected with correlation matrix, random forest algorithm, K-Best, mutual info classifier and variance threshold below 0,05 or 0,1. • The dominant features with 5% below are discarded. • Data was separated as its original train and test data form.
• SMOTENC method was applied to reduce class imbalance with generating data in minority class (for Numerical and Categorical data sets).
• Classification methods were selected according to cross validation score and classification reports. • Mainly, Random Forest and Gradient Boosting Classification models were used for training and prediction.
• Using Grid Search with cross validation, parameters of each algorithm were tuned.
• Performance metrics such as cross validation score, roc_auc score and balanced error rate were evaluated to choose the best one.

LITERATURE

Skewness is the measure of how much the probability distribution of a random variable deviates from the normal distribution. For example, if data is left skewed it means that it has a higher number of data points having low values. Therefore while training model on this data, it will perform better at predicting the lower values which causes bias in the prediction. Since most of the algorithms work better under normality assumption, logarithm transformation can be applied to data to provide normality and increase the validity of the associated statistical analyses.

“Explorations in statistics: the log transformation.” Curran-Everett, D. (2018)

According to Curran-Everett(2018), if the variability—the standard deviation—varies in rough proportion to the mean value of Y, a log transformation of the actual observations can equalize the standard deviations. Briefly, a log transformation can be used for stabilizing the variance. Thus, it can help the distribution of the observations themselves be more normal. It means the theoretical distribution of the sample mean becomes more normal.
This suggestion in the paper inspired us and we tried logarithm transformation for skewed features.

You can find the related paper in here.

“SMOTE for high-dimensional class-imbalanced data.” Rok Blagus & Lara Lusa. (2013)

According to Lusa(2013), classification using class-imbalanced data is biased in courtesy of the majority class. The bias can increase for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be eliminated by undersampling or oversampling so as to produce class-balanced data. In most of the case undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling Technique (SMOTE) is an oversampling method that was proposed to improve random oversampling. Even though it is a quite useful method, its behavior on high-dimensional data has not been thoroughly investigated.
In this paper, authors investigated the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

You can find the related paper in here.

“Arrhythmia Disease Classification and Mobile Based System Design.” Soha Samir AbdElMoneem, Hany Hanafy Said, and Amani Anwar Saad (2019)

According to AbdElMoneem et al.(2020), this research examined four different oversampling techniques which are Synthetic Minority Oversampling Technique (SMOTE), Random over-sampling, Adaptive Synthetic (ADASYN) and Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTENC). SMOTE is considered as a powerful sampling method since this algorithm creates new instances of the minority class by creating convex combinations of neighboring instances. Other over-sampling methods explained as the followings: Random over-sampling is used to repeat some samples in the classes, which are under-represented and balance the number of samples between the dataset. Therefore, it is less biased against the majority class. ADASYN generate new samples in by interpolation. However, the samples used to generate new synthetic samples which are differentiated. It focuses on generating samples next to the original samples which are mistakenly classified using a K-Nearest Neighbors classifier while the basic implementation of SMOTE will not provide any distinction between easy and hard samples to be classified using the nearest neighbors rule. Therefore, the decision function found during training are expected to be different among the algorithms. The SMOTENC is an extension of the SMOTE algorithm for which categorical data behaves differently. This paper encouraged us to use SMOTENC since we have both numerical and categorical data.

You can find the related paper in here.

APPROACH

At the very beginning, data was merged so that we could interpret data as a whole. Each feature was plotted as pairplot and observed whether it is normally distributed since it is known that most of the algorithm works better under normality assumption. Thus, skewness was examined if the feature shows right or left-skewed behavior and logarithm transformation applied to provide the normality. Additionally, outliers and value dominance were observed in the histogram for each column and a correlation matrix was created to see whether features are correlated with each other.
Discarded due to value dominance: {x19, x20, x21, x22, x29, x31, x33, x34, x35, x36, x45, x52, x55, x50, x57, x59 and x60}
MinMaxScaler due to wide range: {x1, x8, x9, x10 and x11}
Discarded due to high correlation: {x37, x46, x49, and x26}
Log1p due to Skewness correction and unnormality: {x14, x27, x30, x32, x42}
Dummy encoding : {x5, x6 and x7}. Since these features consist of discrete integer values between 0-18 interval, it was considered that these are nominal classes and one hot encoding was applied to these features.

SMOTENC method was used generating data in minority class to provide class balance. Since we cannot interpret test score from the original test data, we created the artificial validation data from the original training data. Training set was taken of preprocessed merged data and separated into test set with 0.2 proportion and train set with 0.8 proportion to create an artificial validation data.

Following classification methods were evaluated by the cross validation score and the classification report constructed by the artificially created validation data. Logistic regression, decision tree, random forest, gradient boosting, support vector machine, stochastic gradient descent models were established and then Random Forest and Gradient Boosting were chosen for further classification studies.

The original training and test data set were prepared to predict the target variable for each model. Random forest was chosen among these algorithms since best submission performance was achieved in terms of mean of cross validation score, balanced error rate and roc_auc score by tuning max _depth, minsample leaf and n_estimators parameters with k-folded grid search. Even though random forest has higher values in terms of cross validation score, we can presume that there is overfitting in the training data based on confusion matrix. However, tuning the parameter helped for overfitting and confusion matrix gave us more sensible results.

RESULTS

With the gridsearchCV, the best parameters were achieved as {'max_depth': 15, 'min_samples_leaf': 2, 'n_estimators': 200}.

The scores achieved by the artificial validation data:
CV Roc Auc Score: 0.9534
CV Accuracy Score: 0.8682
Confusion Matrix:
|271 58|
|35 262|

Submission Scores achieved by the original test data: Auc_score: 0.9122 Ber_score: 0.8177 Score: 0.8649

CONCLUSION

For each algorithm, it was hard to determine a parameter set that tunes the model. At first, we used larger parameter set with varied values, however, Random Forest tended to overfit the original training data. By considering the relationship among max_depth, min_sample_leaf and n_estimator parameters, we heuristically reduced the variety of parameters set and examined the opposite relationship of these parameters. As a result, we observed better scores resulting in better prediction.

REFERENCES

Curran-Everett, D. (2018). Explorations in statistics: the log transformation. Advances in physiology education, 42(2), 343-347.

AbdElMoneem, S. S., Said, H. H., & Saad, A. A. (2020). Arrhythmia Disease Classification and Mobile Based System Design. In Journal of Physics: Conference Series (Vol. 1447, No. 1, p. 012014). IOP Publishing.

Lusa, L. (2012, December). Evaluation of smote for high-dimensional class-imbalanced microarray data. In 2012 11th International Conference on Machine Learning and Applications (Vol. 2, pp. 89-94). IEEE.

Correlation Matrix

PairPlot for Continuous Features

Threshold

Select K Best

Feature Importance with Random Forest Classifier

Data Resampling with SMOTENC

Train Test Split of Original Train Data for Validation Purpose

Parameter Tuning

Gradient Boosting Model

Training Gradient Boosting Model for Submission

Random Forest

Training Random Forest Model for Submission